Extracting the Lowest Frequency Words: Pitfalls and Possibilities

نویسندگان

  • Marc Weeber
  • Rein Vos
  • R. Harald Baayen
چکیده

s or to the complete newspaper corpus. This raises the question of whether better results might have been obtained if the complete data sets had been used. In principle, more data might imply more power. At the same time, more data also entails the risk of more noise. At least for our af data, enlarging the complement leads to worse performance. When we allow any sentence that contains af in our analyses, F decreases from 0.31 to 0.23 for G2. When we base the analyses on the complete newspaper corpus, F reduces further to 0.19. The reason for this decrease in performance is probably due to the W C-ratio being very low for all practical window sizes, i.e., at the very left part of the saw-tooth-shaped pattern characterizing Nsig as a function of W C. Consequently, any low-frequency word is singled out as a signiŽcant item whenever it occurs at least once in the window. Given the ZipŽan structure of word-frequency distributions, a great many spurious low-frequency words are extracted. As mentioned in the introduction, the received wisdom is that the windowing method is unreliable for events with a frequency of less than 5. By means of an analysis of the behavior of statistical tests for 2 2 contingency tables with sparse data, a method for optimizing the use of these tests has been developed. We hope that this technique will prove to be useful for domains in which the extraction of low-probability events is crucial.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

رویکردی با ناظر در استخراج واژگان کلیدی اسناد فارسی با استفاده از زنجیره‌های لغوی

Keywords are the main focal points of interest within a text, which intends to represent the principal concepts outlined in the document. Determining the keywords using traditional methods is a time consuming process and requires specialized knowledge of the subject. For the purposes of indexing the vast expanse of electronic documents, it is important to automate the keyword extraction task. S...

متن کامل

High Wire Act: the Perils, Pitfalls and Possibilities of Online Discussions

Online discussions are an important component of both blended and online courses. This paper examines the varieties of online discussions and the perils, pitfalls and possibilities of this rather new technological tool for enhanced learning. The discussion begins with possible perils and pitfalls inherent in this educational tool and moves to a consideration of the advantages of the varieties o...

متن کامل

Improving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm

Keywords can present the main concepts of the text without human intervention according to the model. Keywords are important vocabulary words that describe the text and play a very important role in accurate and fast understanding of the content. The purpose of extracting keywords is to identify the subject of the text and the main content of the text in the shortest time. Keyword extraction pl...

متن کامل

IMPACTS AND CHALLENGES OF CLOUD COMPUTING FOR SMALL AND MEDIUM SCALE BUSINESSES IN NIGERIA

Cloud computing technology is providing businesses, be it micro, small, medium, and large scale enterprises with the same level playing grounds. Small and Medium enterprises (SMEs) that have adopted the cloud are taking their businesses to greater heights with the competitive edge that cloud computing offers. The limitations faced by (SMEs) in procuring and maintaining IT infrastructures has be...

متن کامل

The Contribution of General High-Frequency, Core-Academic, and Academic-Technical Words to ESP Reading Comprehension

Reading is recognized as being the most important skill needed by ESP learners in their field of study, and vocabulary knowledge is the most widely discussed component of effective ESP reading per se. However, research on how much the different types of words exert substantial influences over ESP reading comprehension remains scanty. To address this lacuna, the present study aimed to examine th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2000